Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

DCP-NAS: Discrepant Child-Parent Neural Architecture Search for 1-Bit CNNs

111

In the following, we prove that the Hessian matrix of the loss function is directly related

to the expectation of the covariance of the gradient. Taking the loss function as the negative

logarithm of the likelihood, let X be a set of input data from the network and p(X; ˆw, ˆα)

be the predicted distribution on X under the parameters of the network are ˆw and ˆα, i.e.,

output logits of the head layer.

By omitting ˆw for simplicity, Fisher’s information on the set of probability distributions

P = {pn(X; ˆα), n ∈N} can be described by a matrix whose value in the i-th row and the

j-th column.

Ii,j(ˆα) = EX [^∂^log^pⁿ⁽^X^{; ˆ}^α⁾

∂ˆαi

∂log pn(X; ˆα)

∂ˆαj

(4.36)

Recall that N denotes the number of classes described in Eq. 4.21. It is then trivial to prove

that the Fisher information of the probability distribution set P approaches a scaled version

of the Hessian of log-likelihood as

Ii,j(ˆα) = −EX [^∂²^log^pⁿ⁽^X^{; ˆ}^α⁾

∂ˆαi∂ˆαj

(4.37)

Let Hi,j denote the second-order partial derivatives

∂²

∂ˆαi∂ˆαj ^{. Note that the ﬁrst derivative}

of log-likelihood is

∂log pn(X; ˆα)

∂ˆαi

∂pn(X; ˆα)

pn(X; ˆα)∂ˆαi

(4.38)

The second derivative is

Hi,j log pn(X; ˆα) = ^H^i,j^pⁿ⁽^X^{; ˆ}^α⁾

pn(X; ˆα)

−

∂pn(X; ˆα)

pn(X; ˆα)∂ˆαi

∂pn(X; ˆα)

pn(X; ˆα)∂ˆαj

(4.39)

Considering that

EX (^H^i,j^pⁿ⁽^X^{; ˆ}^α⁾

pn(X; ˆα)

) =

Hi,jpn(X; ˆα)

pn(X; ˆα)

pn(X; ˆα)dX

= Hi,j

pn(X; ˆα)dX = 0,

(4.40)

we take the expectation of the second derivative and then obtain the following.

EX (Hi,j log pn(X; ˆα)) = −EX { ^∂pⁿ⁽^X^{; ˆ}^α⁾

pn(X; ˆα)∂ˆαi

∂pn(X; ˆα)

pn(X; ˆα)∂ˆαj

}

= −EX {^∂pⁿ⁽^X^{; ˆ}^α⁾

∂ˆαi

∂pn(X; ˆα)

∂ˆαj

(4.41)

Thus, an equivalent substitution for the Hessian matrix H ˜

fb^(ˆ^α^{) in Eq. 4.32 is the product}

of two ﬁrst-order derivatives. This concludes the proof that we can use the covariance of

gradients to represent the Hessian matrix for eﬃcient computation.

4.4.6

Decoupled Optimization for Training the DCP-NAS

In this section, we ﬁrst describe the coupling relationship between the weights and the

architecture parameters in the DCP-NAS. Then we present the decoupled optimization

during backpropagation of the sampled supernet to fully and eﬀectively optimize these two

coupling parameters.

Coupled

models

for

DCP-NAS

Combing

Eq.

4.27

and

Eq.

4.28,

ﬁrst

show how parameters in DCP-NAS are formulated in a coupling relationship as